Lip Movements In Non-Focal and Focal Position for Visual Speech Synthesis
نویسندگان
چکیده
In acoustic and visual synthesis based on concatenation of speech units such as demisyllables, the recording of these units is normally taken from nonsense utterances where the demisyllable in question is pronounced in a non-focal position. In the present investigation, the relation between the lip movements in focal and non-focal position is studied and a computational model is hypothesised. 1. BACKGROUND In speech, focal accent is signalled by a fundamental frequency manifestation and by prolonged durations in connection with the manifestation [1] [5]. The prolonged durations are, of course, present also in visual speech, but studies show that there are other differences between focused and non-focused speech in the visual domain. The fundamental frequency manifestation is shown to have some visual equivalent in jaw opening and jaw protrusion in American English [2] and the movements of the lower lip are found to have an increased amplitude in a stressed context in American English [3]. In speech synthesis based on concatenation of natural speech, units such as diphones or polyphones are put together to form any word or sentence [6]. To be able to control the fundamental frequency of the synthesised speech, the recorded speech units should have a more or less constant fundamental frequency and are therefore recorded in a non-focal position. One method of adding a talking head to this kind of synthesis is, simultaneously with the recording of the speech signal to record the facial movements represented by points in the face, put the visual segments together in the same way as the acoustical ones and let the recorded movements control the movements of an animated face [4]. Since all units are recorded in a non-focal position, the recorded speech movements are rather neutral. The aim of the present investigation is to study the relation between lip movements in non-focal and focal position in order to find a computational model that can be used to convert recorded non-focal position movements into focal position movements in visual speech synthesis. 2. EXPERIMENTAL DESIGN A sentence consisting of four monosyllabic words was read aloud by one speaker and recorded. The sentence was presented to the speaker 32 times and each time one word in the sentence was replaced by the nonsense word bab and each time one word was to be prosodically focused, either bab or another non-adjacent word. The fundamental sentence was Per kan gå fort (Per can walk fast) and the variations were: bab kan GÅ fort PER kan bab fort BAB kan gå fort Per kan BAB fort Per bab gå FORT Per KAN gå bab Per BAB gå fort Per kan gå BAB where capitalisation means that the word should carry focal accent. Eight samples of /b$:b/ were recorded in each word position: four with focal accent and four with non-focal accent. The stimuli were presented in random order. The subject was instructed to speak clearly. During the recording, four hemispherical markers made of a reflecting material were attached to the subject ́s lips according to figure 1. The markers were tracked in 60 Hz during the session with the Qualisys MacReflex motion tracking system [7] and the markers ́ movements in three dimensions were stored, together with the recorded acoustic signal. The distance Dy between the upper and lower lip and the distance Dx between the corners of the mouth were observed for all samples of /b$:b/. The lip protrusion was not included in this study. The transform between two curves or numbers is their quotient. In order to get the transform between a non-focused movement and a focused movement, the movements have to be divisible by each other. All curves describing the distance Dy throughout an uttered /b$:b/ were interpolated and stretched to the same length. A mean value was formed of all curves representing focal position (the F-curves) and of all curves representing non focal-position (the Ncurves). The quotient Q (see Eq 1) between those two mean values was formed and a polynomial Pq(t) ! " # $ ISCA Archive %&&'''( ( & ) of degree 5 was fitted to Q. This polynomial Pq(t) could now be used to transform an N-curve into a Fcurve of the same duration. In order to see how well the transform based on data from all word positions worked for the four word positions separately, it was interpolated (Pqi(t)) and compressed to the right duration and applied to the mean value of the Ncurves within each word position (see Eq 2). The result was compared to the mean value of the Fcurves within the respective word position and the distance d was calculated (see Eq 3).
منابع مشابه
Focal Accent and Facial Movements in Expressive Speech
In this paper, we present measurements of visual, facial parameters obtained from a speech corpus consisting of short, read utterances in which focal accent was systematically varied. The utterances were recorded in a variety of expressive modes including Certain, Confirming, Questioning, Uncertain, Happy, Angry and Neutral. Results showed that in all expressive modes, words with focal accent a...
متن کاملA multi-measurement approach to the identification of the audiovisual facial correlates of contrastive focus in French
The aims of this study are twofold. We first seek to validate a previous study conducted on one speaker. It identified lower face visible articulatory correlates of contrastive focus in French for real (nonreiterant) speech. We thus conducted the same study for another speaker. Our second goal was to enlarge the set of cues measured to other facial movements. To do so we used a complementary me...
متن کاملParameterisation of Speech Lip Movements
In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dyn...
متن کاملHMM-based text-to-audio-visual speech synthesis
This paper describes a technique for text-to-audio-visual speech synthesis based on hidden Markov models (HMMs), in which lip image sequences are modeled based on imageor pixel-based approach. To reduce the dimensionality of visual speech feature space, we obtain a set of orthogonal vectors (eigenlips) by principal components analysis (PCA), and use a subset of the PCA coefficients and their dy...
متن کاملParameterisation of 3d speech lip movements
In this paper we describe a parameterisation of lip movements which maintains the dynamic structure inherent in the task of producing speech sounds. A stereo capture system is used to reconstruct 3D models of a speaker producing sentences from the TIMIT corpus. This data is mapped into a space which maintains the relationships between samples and their temporal derivatives. By incorporating dyn...
متن کامل